Clustering Narrow-Domain Short Texts by Using the Kullback-Leibler Distance

نویسندگان

  • David Pinto
  • José-Miguel Benedí
  • Paolo Rosso
چکیده

Clustering short length texts is a difficult task itself, but adding the narrow domain characteristic poses an additional challenge for current clustering methods. We addressed this problem with the use of a new measure of distance between documents which is based on the symmetric Kullback-Leibler distance. Although this measure is commonly used to calculate a distance between two probability distributions, we have adapted it in order to obtain a distance value between two documents. We have carried out experiments over two different narrowdomain corpora and our findings indicates that it is possible to use this measure for the addressed problem obtaining comparable results than those which use the Jaccard similarity measure.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Using Kullback-Leibler distance for performance evaluation of search designs

This paper considers the search problem, introduced by Srivastava cite{Sr}. This is a model discrimination problem. In the context of search linear models, discrimination ability of search designs has been studied by several researchers. Some criteria have been developed to measure this capability, however, they are restricted in a sense of being able to work for searching only one possibl...

متن کامل

Model Confidence Set Based on Kullback-Leibler Divergence Distance

Consider the problem of estimating true density, h(.) based upon a random sample X1,…, Xn. In general, h(.)is approximated using an appropriate in some sense, see below) model fƟ(x). This article using Vuong's (1989) test along with a collection of k(> 2) non-nested models constructs a set of appropriate models, say model confidence set, for unknown model h(.).Application of such confide...

متن کامل

Comparison of Kullback-Leibler, Hellinger and LINEX with Quadratic Loss Function in Bayesian Dynamic Linear Models: Forecasting of Real Price of Oil

In this paper we intend to examine the application of Kullback-Leibler, Hellinger and LINEX loss function in Dynamic Linear Model using the real price of oil for 106 years of data from 1913 to 2018 concerning the asymmetric problem in filtering and forecasting. We use DLM form of the basic Hoteling Model under Quadratic loss function, Kullback-Leibler, Hellinger and LINEX trying to address the ...

متن کامل

Evaluating the Improvement of Partial Discharge Localization Accuracy Using Frequency Response Assurance Criterion

Partial Discharge (PD) is the most important source of insulation degradation in power transformers. In order to prevent catastrophic failures in transformers, PDs need to be located as soon as possible so that maintenance measures can be taken in time. Due to the structural complexity of windings, locating the PD source inside a transformer winding is not a simple task. In this paper, the effi...

متن کامل

Geometric Clustering of Multimedia Databases

Many kinds of texts are now available in various types of databases, and it has been requested to develop new methods to fully utilize them in a wide range of applications. In the eld of information retrieval of full text databases, the vector-space model has been developed over 20 years (Salton et al. [14, 13]), and further the so-called latent semantic indexing based on the singular value dec...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007